Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

s used here to illustrate how the histogram approach worked in

2.1 shows the data structure.

A count matrix after the sequencing reads have been mapped to a reference

the airway data. Each count represents the times by which a sequencing read

mapped to a gene. The gene IDs have been shortened by removing the prefix

0000 and the sample IDs have also been shortened by removing the prefix

For instance, the full ID of gene E003 is ENSG00000000003 and the full ID

S08 is SRR1039508.

S08

S09

S12

S13

S16

S17

S20

S21

723

486

904

445

1170

1097

806

604

467

523

616

371

582

781

417

509

347

258

364

237

318

447

330

324

118

102

ode below was used to call hist to estimate a density function

rst replicate of this data set.

ist(log(x[which(x[,1]>0),1]),nclass=50)

above code, x was a prepared count matrix for this data set. A

unction was estimated based on only non-zero count values. hence

x[,1]>0) was used. The bin number was 50. Figure 2.3 shows

gram of the logarithm-transformed non-negative count data of

ples (SRR1039508 and SRR1039509).

wo histograms of the logarithm-transformed non-zero sequencing counts of the

RR1039508 and SRR1039509 for the airway data.